
Russian parametric corpus RuParam

Annotation
The main function of large language models is to simulate the behavior of native speakers in the most correct way. Hence, it is necessary to have assessment datasets to track progress in solving this problem as well as regularly compare competing models with each other. There are some datasets of this type, the so-called linguistic acceptability corpora. The hypothesis that underlies the creation of these corpora assumes that large language models, like native speakers, should be able to distinguish correct, grammatical sentences from the ungrammatical ones that violate the grammar of the target language. The paper presents the parametric corpus for Russian, RuParam. Our corpus contains 9.5 thousand minimal pairs of sentences that differ in grammaticality — each correct sentence corresponds to a minimally different erroneous one. The source of ungrammaticality in each pair is supplied with the linguistic markup provided by experts. RuParam consists of two parts: the first part uses a totally new data source for the task of testing large language models — lexical and grammatical tests on Russian as a foreign language. The second part consists of (modified and tagged) examples from real texts that represent grammatical phenomena, not included in the RFL teaching program due to their complexity. As have shown our experiments with different Large Language Models, the highest results are achieved by those models that have been trained on Russian most carefully at all stages, from data preparation and tokenization to writing instructions and reinforcement learning (these are first of all YandexGPT and GigaChat). Multilingual models, which usually receive little or no emphasis on Russian, showed significantly lower results. Still, even the best models results are far from the assessors who completed the task with almost 100 % accuracy. The models ranking obtained during the experiment shows that our corpus reflects actual degree of proficiency in Russian. The resulting rating can be helpful when choosing a model for natural language processing task requiring grammar knowledge: for example, building morphological and syntactic parsers. Thereafter, the proposed corpus can be used to test your own models.
Keywords
Постоянный URL
Articles in current issue
- Design and fabrication of collimating ball lensed fiber for the system of optical radiation output from radiophotonic components
- From Triassic to moderinity: Raman spectroscopy for differentiation of fossil resins age by age
- Optimization of geometry of two-dimensional photonic crystal waveguide for telecommunications and sensorics
- Development and investigation of the suppressing additive noises methods in fiber-optic interferometric sensors
- Method for compensating the constant component of noise in the reflectogram of a fiber-optic communication line under conditions of insufficient dynamic range of an optical backscatter reflectometer in the time domain
- Investigation of the method of moving object weight measurement based onquasidistributed fiber Bragg gratings with temperature compensation
- Modern optical methods of non-contact geometric measurements and reconstruction of object 3D surface shape: a review
- Spectral-luminescent properties of silver clusters Ag1–5 in the ion-exchange layer of silicate glass
- Forming a thick layer of ε-Ga2O3 on the GaN sublayer with V-defects at the interface
- A model for ensuring the continuity of the safe functioning of the product quality traceability system in conditions of unstable communication
- Application of Markov chain Monte Carlo and machine learning for identifying active modules in biological graphs
- Surface defect detection with limited data based on SSD detector and Siamese networks
- Sentiment analysis of Arabic tweets using supervised machine learning (in English)
- Comparative analysis of AI-generated and original abstracts of academic articles on philology
- Enhancing Kubernetes security with machine learning: а proactive approach to anomaly detection
- Prompt-based multi-task learning for robust text retrieval
- Improving question answering in programming domain with pretrained language model finetuning using structured diverse online forum data
- Specification language for automatа-based objects cooperation
- Aspects of organizing game interactions among asymmetric agents using graph neural networks
- Development and modeling of technological scheme of steam methane reforming with oxy-fuel combustion and carbon capture
- Stability study of hybrid MOS memristor memory using modified particle swarm optimization method
- Analysis of the vulnerability of YOLO neural network models to the Fast Sign Gradient Method attack